Goto

Collaborating Authors

 surgical task


Cosmos-Surg-dVRK: World Foundation Model-based Automated Online Evaluation of Surgical Robot Policy Learning

Zbinden, Lukas, Nelson, Nigel, Chen, Juo-Tung, Chen, Xinhao, Kim, Ji Woong, Azizian, Mahdi, Krieger, Axel, Huver, Sean

arXiv.org Artificial Intelligence

We propose a framework for automated policy evaluation using Cosmos-Surg-dVRK, a Cosmos world foundation model (WFM) finetune, to perform simulated surgical policy rollouts and subsequent automated success rate evaluation using a video classifier. After rollout completion, the generated video is automatically labeled for task success or failure using a trained video classifier, enabling objective selection of the most promising surgical policies for real-robot evaluation and deployment. The rise of surgical robots and vision-language-action models has accelerated the development of autonomous surgical policies and efficient assessment strategies. However, evaluating these policies directly on physical robotic platforms such as the da Vinci Research Kit (dVRK) remains hindered by high costs, time demands, reproducibility challenges, and variability in execution. World foundation models (WFM) for physical AI offer a transformative approach to simulate complex real-world surgical tasks, such as soft tissue deformation, with high fidelity. This work introduces Cosmos-Surg-dVRK, a surgical finetune of the Cosmos WFM, which, together with a trained video classifier, enables fully automated online evaluation and benchmarking of surgical policies. On tabletop suture pad tasks, the automated pipeline achieves strong correlation between online rollouts in Cosmos-Surg-dVRK and policy outcomes on the real dVRK Si platform, as well as good agreement between human labelers and the V -JEP A 2-derived video classifier. Additionally, preliminary experiments with ex-vivo porcine cholecystectomy tasks in Cosmos-Surg-dVRK demonstrate promising alignment with real-world evaluations, highlighting the platform's potential for more complex surgical procedures. World models have emerged as a foundational approach for enabling intelligent agents to understand and act in complex simulated environments. Building on early work by Ha & Schmidhuber (2018), they learn compact latent representations of spatial and temporal dynamics and predict the consequences of actions in context. Leveraging diffusion processes, recent efforts have introduced large-scale multi-modal world foundation models (WFMs) (Wan et al., 2025; Agarwal et al., 2025; Kong et al., 2024; Xie et al., 2025; Ball et al., 2025). These video generative models serve as scalable, general-purpose learned simulators that encode and synthesize diverse physical phenomena, approximate scene dynamics, render plausible sensory observations, and facilitate policy evaluation and training, enabling progress on sim-to-real transfer. At the intersection of these advances lies the burgeoning field of physical AI, AI systems equipped with sensors and actuators that enable perception, reasoning, and actuation in the physical world.


Human-Robot collaboration in surgery: Advances and challenges towards autonomous surgical assistants

Colan, Jacinto, Davila, Ana, Yamada, Yutaro, Hasegawa, Yasuhisa

arXiv.org Artificial Intelligence

This work has been accepted at the 2025 IEEE International Conference on Robot and Human Interactive Communication (ROMAN) and submitted to the IEEE for possible publication. Abstract -- Human-robot collaboration in surgery represents a significant area of research, driven by the increasing capability of autonomous robotic systems to assist surgeons in complex procedures. This systematic review examines the advancements and persistent challenges in the development of autonomous surgical robotic assistants (ASARs), focusing specifically on scenarios where robots provide meaningful and active support to human surgeons. Adhering to the PRISMA guidelines, a comprehensive literature search was conducted across the IEEE Xplore, Scopus, and Web of Science databases, resulting in the selection of 32 studies for detailed analysis. Two primary collaborative setups were identified: teleoperation-based assistance and direct hands-on interaction. The findings reveal a growing research emphasis on ASARs, with predominant applications currently in endoscope guidance, alongside emerging progress in autonomous tool manipulation. Several key challenges hinder wider adoption, including the alignment of robotic actions with human surgeon preferences, the necessity for procedural awareness within autonomous systems, the establishment of seamless human-robot information exchange, and the complexities of skill acquisition in shared workspaces. This review synthesizes current trends, identifies critical limitations, and outlines future research directions essential to improve the reliability, safety, and effectiveness of human-robot collaboration in surgical environments. I. INTRODUCTION Surgical robotics has substantially reshaped modern operative workflows; however, current systems operate primarily under direct teleoperated control, thereby limiting their potential as truly collaborative partners.


Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

Baby, Britty, Srivastav, Vinkle, Jain, Pooja P., Yuan, Kun, Mascagni, Pietro, Padoy, Nicolas

arXiv.org Artificial Intelligence

The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet's multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: https://github.com/CAMMA-public/CVS-AdaptNet


Implementation and Assessment of an Augmented Training Curriculum for Surgical Robotics

Rota, Alberto, Fan, Ke, De Momi, Elena

arXiv.org Artificial Intelligence

--The integration of high-level assistance algorithms in surgical robotics training curricula may be beneficial in establishing a more comprehensive and robust skillset for aspiring surgeons, improving their clinical performance as a consequence. This work presents the development and validation of a haptic-enhanced Virtual Reality simulator for surgical robotics training, featuring 8 surgical tasks that the trainee can interact with thanks to the embedded physics engine. This virtual simulated environment is augmented by the introduction of high-level haptic interfaces for robotic assistance that aim at re-directing the motion of the trainee's hands and wrists toward targets or away from obstacles, and providing a quantitative performance score after the execution of each training exercise. An experimental study shows that the introduction of enhanced robotic assistance into a surgical robotics training curriculum improves performance during the training process and, crucially, promotes the transfer of the acquired skills to an unassisted surgical scenario, like the clinical one. The increase of surgical robotics procedures in the last decade demands a high number of trained surgeons [1] [2], capable of teleoperating such advanced and complex systems and at the same time able to take advantage of the benefits of Robot-Assisted Minimally Invasive Surgery (RAMIS) safely and effectively.


SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks

Moghani, Masoud, Nelson, Nigel, Ghanem, Mohamed, Diaz-Pinto, Andres, Hari, Kush, Azizian, Mahdi, Goldberg, Ken, Huver, Sean, Garg, Animesh

arXiv.org Artificial Intelligence

Behavior cloning facilitates the learning of dexterous manipulation skills, yet the complexity of surgical environments, the difficulty and expense of obtaining patient data, and robot calibration errors present unique challenges for surgical robot learning. We provide an enhanced surgical digital twin with photorealistic human anatomical organs, integrated into a comprehensive simulator designed to generate high-quality synthetic data to solve fundamental tasks in surgical autonomy. We present SuFIA-BC: visual Behavior Cloning policies for Surgical First Interactive Autonomy Assistants. We investigate visual observation spaces including multi-view cameras and 3D visual representations extracted from a single endoscopic camera view. Through systematic evaluation, we find that the diverse set of photorealistic surgical tasks introduced in this work enables a comprehensive evaluation of prospective behavior cloning models for the unique challenges posed by surgical environments. We observe that current state-of-the-art behavior cloning techniques struggle to solve the contact-rich and complex tasks evaluated in this work, regardless of their underlying perception or control architectures. These findings highlight the importance of customizing perception pipelines and control architectures, as well as curating larger-scale synthetic datasets that meet the specific demands of surgical tasks. Project website: https://orbit-surgical.github.io/sufia-bc/


FF-SRL: High Performance GPU-Based Surgical Simulation For Robot Learning

Dall'Alba, Diego, Nasket, Michał, Kaminska, Sabina, Korzeniowski, Przemysław

arXiv.org Artificial Intelligence

Robotic surgery is a rapidly developing field that can greatly benefit from the automation of surgical tasks. However, training techniques such as Reinforcement Learning (RL) require a high number of task repetitions, which are generally unsafe and impractical to perform on real surgical systems. This stresses the need for simulated surgical environments, which are not only realistic, but also computationally efficient and scalable. We introduce FF-SRL (Fast and Flexible Surgical Reinforcement Learning), a high-performance learning environment for robotic surgery. In FF-SRL both physics simulation and RL policy training reside entirely on a single GPU. This avoids typical bottlenecks associated with data transfer between the CPU and GPU, leading to accelerated learning rates. Our results show that FF-SRL reduces the training time of a complex tissue manipulation task by an order of magnitude, down to a couple of minutes, compared to a common CPU/GPU simulator. Such speed-up may facilitate the experimentation with RL techniques and contribute to the development of new generation of surgical systems. To this end, we make our code publicly available to the community.


Diffusion Stabilizer Policy for Automated Surgical Robot Manipulations

Ho, Chonlam, Hu, Jianshu, Wang, Hesheng, Dou, Qi, Ban, Yutong

arXiv.org Artificial Intelligence

Diffusion Stabilizer Policy for Automated Surgical Robot Manipulations Chonlam Ho 1,, Jianshu Hu 1,, Hesheng Wang 2, Qi Dou 3, and Y utong Ban 1 Abstract -- Intelligent surgical robots have the potential to revolutionize clinical practice by enabling more precise and automated surgical procedures. However, the automation of such robot for surgical tasks remains under-explored compared to recent advancements in solving household manipulation tasks. These successes have been largely driven by (1) advanced models, such as transformers and diffusion models, and (2) large-scale data utilization. Aiming to extend these successes to the domain of surgical robotics, we propose a diffusion-based policy learning framework, called Diffusion Stabilizer Policy (DSP), which enables training with imperfect or even failed trajectories. Our approach consists of two stages: first, we train the diffusion stabilizer policy using only clean data. Then, the policy is continuously updated using a mixture of clean and perturbed data, with filtering based on the prediction error on actions. Comprehensive experiments conducted in various surgical environments demonstrate the superior performance of our method in perturbation-free settings and its robustness when handling perturbed demonstrations.


SurgIRL: Towards Life-Long Learning for Surgical Automation by Incremental Reinforcement Learning

Ho, Yun-Jie, Chiu, Zih-Yun, Zhi, Yuheng, Yip, Michael C.

arXiv.org Artificial Intelligence

Surgical automation holds immense potential to improve the outcome and accessibility of surgery. Recent studies use reinforcement learning to learn policies that automate different surgical tasks. However, these policies are developed independently and are limited in their reusability when the task changes, making it more time-consuming when robots learn to solve multiple tasks. Inspired by how human surgeons build their expertise, we train surgical automation policies through Surgical Incremental Reinforcement Learning (SurgIRL). SurgIRL aims to (1) acquire new skills by referring to external policies (knowledge) and (2) accumulate and reuse these skills to solve multiple unseen tasks incrementally (incremental learning). Our SurgIRL framework includes three major components. We first define an expandable knowledge set containing heterogeneous policies that can be helpful for surgical tasks. Then, we propose Knowledge Inclusive Attention Network with mAximum Coverage Exploration (KIAN-ACE), which improves learning efficiency by maximizing the coverage of the knowledge set during the exploration process. Finally, we develop incremental learning pipelines based on KIAN-ACE to accumulate and reuse learned knowledge and solve multiple surgical tasks sequentially. Our simulation experiments show that KIAN-ACE efficiently learns to automate ten surgical tasks separately or incrementally. We also evaluate our learned policies on the da Vinci Research Kit (dVRK) and demonstrate successful sim-to-real transfers.


Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks

Kim, Ji Woong, Zhao, Tony Z., Schmidgall, Samuel, Deguet, Anton, Kobilarov, Marin, Finn, Chelsea, Krieger, Axel

arXiv.org Artificial Intelligence

We explore whether surgical manipulation tasks can be learned on the da Vinci robot via imitation learning. However, the da Vinci system presents unique challenges which hinder straight-forward implementation of imitation learning. Notably, its forward kinematics is inconsistent due to imprecise joint measurements, and naively training a policy using such approximate kinematics data often leads to task failure. To overcome this limitation, we introduce a relative action formulation which enables successful policy training and deployment using its approximate kinematics data. A promising outcome of this approach is that the large repository of clinical data, which contains approximate kinematics, may be directly utilized for robot learning without further corrections. We demonstrate our findings through successful execution of three fundamental surgical tasks, including tissue manipulation, needle handling, and knot-tying.


Task segmentation based on transition state clustering for surgical robot assistance

Yamada, Yutaro, Colan, Jacinto, Davila, Ana, Hasegawa, Yasuhisa

arXiv.org Artificial Intelligence

Understanding surgical tasks represents an important challenge for autonomy in surgical robotic systems. To achieve this, we propose an online task segmentation framework that uses hierarchical transition state clustering to activate predefined robot assistance. Our approach involves performing a first clustering on visual features and a subsequent clustering on robot kinematic features for each visual cluster. This enables to capture relevant task transition information on each modality independently. The approach is implemented for a pick-and-place task commonly found in surgical training. The validation of the transition segmentation showed high accuracy and fast computation time. We have integrated the transition recognition module with predefined robot-assisted tool positioning. The complete framework has shown benefits in reducing task completion time and cognitive workload.